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Substudies of the Childhood Asthma Management Program [Con¬ 
trol. Clin. Trials 20 (1999) 91-120; N. Engl. J. Med. 343 (2000) 
1054-1063] seek to identify patient characteristics associated with 
asthma symptoms and lung function. To determine if genetic mea¬ 
sures are associated with trajectories of lung function as measured 
by forced vital capacity (FVC), children in the primary cohort study 
retrospectively had candidate loci evaluated. Given participant bur¬ 
den and constraints on hnancial resources, it is often desirable to 
target a subsample for ascertainment of costly measures. Methods 
that can leverage the longitudinal outcome on the full cohort to se¬ 
lectively measure informative individuals have been promising, but 
have been restricted in their use to analysis of the targeted subsam¬ 
ple. In this paper we detail two multiple imputation analysis strate¬ 
gies that exploit outcome and partially observed covariate data on 
the nonsampled subjects, and we characterize alternative design and 
analysis combinations that could be used for future studies of pul¬ 
monary function and other outcomes. Candidate predictor (e.g., ILIO 
cytokine polymorphisms) associations obtained from targeted sam¬ 
pling designs can be estimated with very high efficiency compared to 
standard designs. Further, even though multiple imputation can dra¬ 
matically improve estimation efficiency for covariates available on all 
subjects (e.g., gender and baseline age), relatively modest efficiency 
gains were observed in parameters associated with predictors that 
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are exclusive to the targeted sample. Our results suggest that future 
studies of longitudinal trajectories can be efficiently conducted by use 
of outcome-dependent designs and associated full cohort analysis. 

1. Introduction. The Childhood Asthma Management Program [CAMP; 
CAMP Research Group (1999 and 2000)] was a randomized clinical trial that 
compared two anti-inflammatory medications and a placebo on lung growth 
over the course of 4 years in children with mild to moderate asthma. CAMP 
substudies have since examined the relationship between genetic factors and 
asthma phenotypes. Like other genetic data collected in CAMP, interleukin- 
10 (ILIO) genotype data were obtained retrospectively by analysis of stored 
blood samples. ILIO is a type-2 T-helper cytokine with anti-inflammatory 
properties, and polymorphisms in the ILIO cytokine gene have been shown 
to be associated with asthma phenotypes in children [Lyon et al. (2004)]. 
However, as is often the case, ascertainment of expensive exposures can 
restrict sample size and therefore motivate thoughtful sampling strategies. 
Given that the outcome of interest was available on all subjects, we seek 
to determine whether the longitudinal response could or should be used to 
target a subset of select individuals for sampling of covariates. In particu¬ 
lar, we explore both sampling designs and associated analysis options with 
the goal of providing recommendations for the efficient conduct of future 
retrospective studies. 

We are specifically interested in the impact genetic variants have on both 
lung function and growth, and on the effect of medication (versus placebo) 
within subgroups defined by genetic variants of the ILIO gene. For nearly all 
children, forced vital capacity (FVG, a measure of lung function) was mea¬ 
sured ten times over the course of 4 years, thereby providing rich detail on 
the primary response trajectory. Our scientific question can be addressed by 
appropriate longitudinal regression models with a focus on estimating the 
main effects of time since randomization, time-invariant randomized treat¬ 
ment assignment (Budesonide, Nedocromil, placebo), and their interactions 
with the presence or absence of at least one ILIO polymorphism. Valid ILIO 
and other data were available for 555 children who participated in CAMP. 
Even though all data were available for these children, we will illustrate 
the interplay between sampling strategies and analysis procedures by as¬ 
suming study resources are limited and ILIO data can only be collected on 
approximately 250 children. The assumption of limited resources allows us to 
compare and contrast several sampling designs and estimation procedures in 
order to inform decisions when conducting similar substudies in the future. 

In related work, Neuhaus, Scott and Wild (2002, 2006) discussed biased, 
outcome dependent sampling (ODS) designs with longitudinal response data 
and estimation from resulting data using a profile likelihood. In the longi¬ 
tudinal binary response setting, Schildcrout and Heagerty (2008, 2011) de¬ 
scribed stratified sampling designs based on the sum of the response series 
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with an ascertainment corrected likelihood approach for analysis. Schild- 
crout and Rathouz (2010), Schildcrout et al. (2012) and Neuhaus et al. 
(2014) addressed auxiliary variable dependent sampling where the sampling 
variable is related but is not equal to the longitudinal response. In the uni¬ 
variate continuous response setting, Zhou et al. (2002, 2007) and Weaver 
and Zhou (2005) discussed ODS designs that combine simple random sam¬ 
ples with a sample of subjects whose responses are more extreme. Further, 
several authors discussed unplanned outcome-dependent follow-up for lon¬ 
gitudinal continuous response data [e.g., Lin and Ying (2001), Lipsitz et al. 
(2002), Buzkova and Lumley (2009)]. 

In Schildcrout, Garbett and Heagerty (2013), we proposed biased epi¬ 
demiological study designs for continuous longitudinal response data where 
sampling is based on strata dehned by low-dimensional summaries of the 
response series. We proposed sampling based on the intercept, the slope, or 
both the intercept and slope of the subject-specific ordinary least squares 
(OLS) regressions of the response on a time-varying covariate (which may 
be time itself). We showed that sampling based on a variable related to a 
target predictor can lead to substantial efficiency gains relative to random 
sampling for the associated parameter. Such a result is well known to survey 
sampling methodologists [e.g., see Kish (1965), Korn and Graubard (2011)]. 
The estimation procedure discussed in Schildcrout, Garbett and Heagerty 
(2013) used a bias correcting, ascertainment corrected conditional likelihood 
that only includes subjects with fully observed exposure data (i.e., those who 
were sampled). Such an analysis can be referred to as a complete data (CD) 
analysis [Carroll et al. (2006), Lawless, Kalbfleisch and Wild (1999)]. In uni¬ 
variate response settings, such as the case-cohort design, other authors [e.g., 
Breslow et al. (2009a, 2009b), Marti and Chavance (2011)] have shown that 
utilizing the partial data on the unsampled subjects can add information 
and improve estimation efficiency. 

With specific motivation from the CAMP study, the purpose of this 
manuscript is to detail the joint impact of sampling design and statisti¬ 
cal analysis decisions toward efficient parameter estimation with longitu¬ 
dinal continuous response data. Longitudinal outcome-dependent sampling 
designs have only recently been proposed, and analysis options have not 
considered use of both sampled and unsampled subjects. Using the CAMP 
study for motivation and illustration, we focus on the following goals: (1) 
to evaluate circumstances under which multiple imputation (MI) increases 
efficiency appreciably over the bias-correcting complete data (CD) analy¬ 
sis under ODS designs, and (2) to evaluate the extent to which the ODS 
designs improve estimation efficiency when MI (rather than CD analysis) 
is the chosen analytical approach. We use a simulation study to explore 
relative efficiency across several sampling design and estimation procedure 
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combinations. The CAMP study is an exemplar of a longitudinal random¬ 
ized trial in which retrospective collection of additional explanatory data is 
conducted to in order to leverage the original cohort study and answer new 
scientific questions. The CAMP data provide an ideal context to inform ef¬ 
ficient study design options for future ancillary studies of factors associated 
with longitudinal outcome trajectories. 

Section 2 discusses the model of interest, briefly reviews the sampling 
strategy and estimation procedure discussed in Schildcrout, Garbett and 
Heagerty (2013), and proposes two multiple imputation analysis strategies 
that exploit the unsampled subjects’ data. Section 3 examines the relative 
efficiency of design and analysis procedures in a number of plausible scenar¬ 
ios. Section 4 returns to the CAMP data to examine the impact of study 
designs on the FVC data, and Section 5 provides a discussion including 
directions for future research. 

2. Methodological framework. We now introduce the mixed model, the 
class of CDS designs and associated CD analyses, and two multiple impu¬ 
tation (MI) extensions for conducting analyses. 

2.1. Linear mixed effects model for continuous longitudinal response data. 
With N subjects in the original cohort, Yj, i G 1,2,... ,Y, the nj-vector of 
response values, ^i, a Ui x p fixed effects design matrix, and Zj the Ui x q 
design matrix for the random effects, we begin with the Laird and Ware 
(1982) linear mixed effects model given by 

(2.1) Yj = Xj/3-|-Zjbj-|-Sj, 

where /3 is a p-vector of fixed-effect coefficients, bj ~ Y(0,D), and ~ 
N{0, S). A common design matrix for the random effects in the continuous 
data setting is Zj = (l,Tj), where Tj is a time-varying covariate—perhaps 
time itself, bj = {boiffu), and Dj is the 2x2 covariance matrix contain¬ 
ing variance components {aQ,af) and correlation p = con {boi,bu)- Analysis 
based on a random sample of Ng subjects can be conducted by maximizing 
the log-likelihood 

M Ns 

(2.2) m Y, X) = ^ k{e-, Y„ Xj) = ^ log /(Y,|X,; 0), 

i=l i=l 

where 6 = (/3,no, cJi,p) and /(•) is the multivariate normal density function. 

2.2. Coarsened summary sampling designs. Study designs proposed in 
Schildcrout, Garbett and Heagerty (2013) propose subsampling from a larger 
cohort based on a user defined, low-dimensional summary of the outcome 


BIASED SAMPLING DESIGNS TO IMPROVE RESEARCH EFFICIENCY 5 


vector Yj or, more accurately, on strata defined by the summary mea¬ 
sure. Let Xoj be a covariate subset of Xj that is known prior to initia¬ 
tion of the substudy and let Q* = g{Yi,X.oi) be any function of the re¬ 
sponse and observed covariates that summarizes important features of the 
response vectors. Three simple and useful summaries are the estimated 
intercept, slope, and the joint intercept and slope, based on the subject- 
specific OLS regression of Yj on a time-varying covariate. For example, 
if Tj is the easily ascertained time-varying covariate, X^j = (l,Tj) C Xq*, 
and Woi = (X*-Xtj)“^Xj^, then Qj = WojYj is the estimated intercept and 
slope for the regression of Yj on Tj. We proposed stratified random sampling 
based on regions of Qj. Based on results from other literature [e.g., Zhou et 
al. (2002, 2007, 2011)], we oversampled the extremes of the Qj distribution 
to realize substantial efficiency gains for target parameters. Let Si equal 1 
if subject i is sampled for exposure ascertainment and 0 if not. For region 
G{R^,...,R^}, let 7r(ii^) = pr(5j = l|Yj,Xj) = pr(5j = l|qjGi?'=) be 
the probability of being sampled given qj, the observed value of Qj, is in 
region k. Importantly, Si T (Yj,Xj)|qj, that is, sampling depends upon the 
data (Yj,Xj) only through qj. 


2.3. An ascertainment corrected likelihood for coarsened summary sam¬ 
pling designs. For inferences to the population represented by the origi¬ 
nal cohort—as opposed to the pseudo-population represented by the biased 
sample—Schildcrout, Garbett and Heagerty (2013) considered maximization 
of an ascertainment corrected likelihood (ACL). The ACL corrects for the 
design by conditioning the likelihood on inclusion into the CDS {Si = 1). It is 
a “complete data” (CD) likelihood [Carroll et al. (2006), Lawless, Kalbfleisch 
and Wild (1999)] in that only subjects with complete exposure data con¬ 
tribute to the conditional likelihood, and therefore to the analysis. A key 
attraction of the CD approach is that valid inferences can be realized while 
only requiring a model for Yj|Xj without requiring a model for Xj. Specif¬ 
ically, if /(Yj|Xj;0) is the density for subject i under simple random sam¬ 
pling from a population, the density for those who are included in the CDS 
is given by 


I -1 


/(Yj|Xj,5j = l;0) 

(2.3) = 7r(qj)/(Yj|Xj; 0){pr(5j = l|Xj;0)}- 

= 7r(qj)/(Yj|Xj;0)|v7r(i2'=) [ /(qj|Xj; 0) dqj 

[ti 


-1 


where 7r(qj) is subject z’s sampling probability that is based on qj [i.e., 
7r(qj) = 7r{R^) if and only if qj G R^], 7r{R^) is the sampling probability for 
all values of Qj in region R^, and /(qj|Xj; 0) dq* = pr(qj G i?^|Xj;0). 
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Because '7r(qj) is parameter-free, being specified by the investigator, if a 
total of Ng subjects are selected into the ODS for exposure ascertainment, 
the ascertainment corrected log-likelihood, /‘^(0;Y,X), is given by 



(2.4) 


In the special case where Qj = WojYj is a linear transformation of Yj, un¬ 
der the assumption Yj|Xj ~ Y(//j,Vj), then Qi|Xj ~ j, where 

Hqi = WoifJ-i and = WojVjW(,j. Thus, the ACL is a straightforward 
extension of the likelihood used for standard analyses, and details can be 
found in Schildcrout, Garbett and Heagerty (2013). We note that this log- 
likelihood is composed of two terms: the standard log-likelihood as in equa¬ 
tion (2.2) and an additive ascertainment correction piece that accounts for 
the biased study design and is the probability of being sampled as a function 
of Xoj. This is in contrast to inverse probability weighting or weighted like¬ 
lihood approaches [e.g., Horvitz and Thompson (1952), Robins, Rotnitzky 
and Zhao (1994)] that multiply the log-likelihood by a function of the sam¬ 
pling probability to calculate an unbiased estimating equation. 

2.4. Multiple imputation. Whereas the analysis procedures proposed in 
Schildcrout, Garbett and Heagerty (2013) were more efficient than ran¬ 
dom sampling, one can expect that there may be additional information 
in those subjects for whom the unmeasured, expensive exposure, was 
not ascertained (i.e., those with S'j = 0). We therefore propose to multiply 
impute [Rubin (1976)] X^i for all subjects in whom Si = 0. Multiple im¬ 
putation (MI) is expected to recover some of the information about the 
parameter associated with X^i that is lost by not measuring X^i, and it 
is expected to recover much more of the information in parameters asso¬ 
ciated with Xoj that is available but is not used in CD analyses. Multi¬ 
ple imputation is attractive because it can leverage existing methods and 
software without needing tailored programs. In the approaches described 
below, we generate imputation samples from the conditional exposure dis¬ 
tribution in unsampled subjects [Xei|Yj,Xoj,5i = 0]. Once the exposure 
model is constructed, we build M multiple imputation data sets, fit the 
target model to each one using standard maximum likelihood, and combine 
estimates across imputations to make inferences regarding model param¬ 
eters. For any parameter 9 in 0, we may estimate its value and variance 


with 9 = M ^ Z)m=i Var(0) = R -|- (1 -|- M ^)B, respectively, where 

R = and H = (M - I)"! - 0)2. With ad¬ 


equate M, test statistics for parameters are well approximated by a stan- 
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dard Gaussian distribution; however, with small M, a t-distribution with 
df = (M — 1)[1 + MV/{{M + degrees of freedom is more appropriate 

[Rubin (1976), Little and Rubin (2002), Schafer and Graham (2002)]. In the 
settings we believe our designs could be most useful, Xei is to be imputed 
in a relatively large percentage of subjects (i.e., well over 50 percent), and 
in such cases a larger number of imputation samples are required to use the 
normal approximation to the t-distribution. 

We now describe two approaches to estimating the imputation model 
[Xei\yi,:x-oi, Si = 0]. The first is an extension of the GD analysis described 
in Section 2.3 and the second is a direct imputation approach that does 
not require estimation based on maximizing the AGL. Because the ODS 
sampling schemes we have described depend upon the data through a low¬ 
dimensional response summary and possibly observed covariates Xoj, 

(2.5) pr:{xei\xoi,yi,Si = 0) = pr(xei|xoj, y,) = pr(xej|xoj, y,, 5, = 1). 

Thus, the design-based “missing data mechanism” is ignorable and gener¬ 
ating Xei for unsampled subjects can be based directly on model estimates 
derived from sampled data without consideration of the biased sample. Im¬ 
portantly, for the GAMP analysis, the missing exposure variable (Xgi) was 
binary and so for the present research, we only detail this special case ex¬ 
plicitly; however, extensions to continuous and other exposure values are 
feasible. 

2.4.1. Imputation model construction: Combine response model and mar¬ 
ginal exposure model. The complete data plus multiple imputation analysis 
approach (GD-I-MI) combines the estimates from maximizing the ACL in 
Section 2.3 with an exposure model for \Xei\^oi-,Si = 1] to estimate [Xei\yi, 
^oi,Si = 0]. Specifically, we combine a CD estimate of [Yjlxj,^* = 1] with 
a covariate logistic regression for [Xei\xoi, Si = 1] to identify the conditional 
exposure distribution [Xei\yi,Xoi, Si = 1] used for imputation among those 
with Si = 0. Using equation (2.5) and Bayes’ theorem, 

prjXei = l|xo^,yi,5i = 0) 
pr(Aei = 0|xoi, yi. Si = 0) 

(2.6) 

_ f {yi\Xei — 1, Xqj , Si — 1) pr(Agj — 1 |Xoj , Si — 1) 

f{yi\Xei = 0,Xoj, S'* = 1) piiXei = 0|Xoj, = 1) ' 

Using the logistic regression model to obtain estimate irr(a:ei|xoj, iSj = 1) in 
the observed subjects’ data, and then combining it with f{yi\xei,Xoi, Si = l) 
from the CD analysis, we are able to estimate and sample from pf(xeijxoj, y,, 
5i = 0). 
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Note. We may write the exposure odds model itself as 
pr(Xe^ = = 1) 

pr(Xei = 0|xoj,5i = 1) 

(2.7) 

_ pr(Si = l\Xei = l,Xoj) pr(Xei = l|Xoj) 
pr(5i = l\Xei = 0, Xoi) pr{Xei = 0|xoj)' 

The first term on the right side of the equation is a ratio of the ascertainment 
corrections for X^i = 1 and X^i = 0 that is shown in equation (2.3). We can 
therefore use the log of the ratio of ascertainment corrections as an offset in 
a logistic regression, marginal exposure model given by (2.7). In some cases, 
such an approach may be more natural or simple than modeling the marginal 
exposure model on the left side of equation (2.7) directly. This is due to 
the fact that the marginal exposure model, pr(Xej|Xoj), may be simpler 
in the population as compared to the observed sample, pr(Xei|Xoj,5* = 
1). For example, in many realistic scenarios, one would expect that time- 
varying and time-invariant covariates are independent in the population. 
In the CAMP, time since randomization is expected to be independent of, 
say, genotype. However, for the biased sample, such time-varying covariates 
may be spuriously associated with genotype due to their impact on the 
probability of being sampled. If one wished to model the left-hand side of 
equation (2.7) directly, the functional forms of time-varying covariates must 
be carefully considered. 

The steps for creating the imputation data sets used in the CD-I-MI ap¬ 
proach are as follows: 

(1) On sampled subjects, 5^ = 1, maximize the ascertainment corrected log- 
likelihood shown in equation (2.4) to obtain estimates 6 and uncertainty 
Cov(0) associated with the response model. 

(2) For m = 1,... ,M, draw 0^™^ from the approximate posterior distribu¬ 

tion for 0 given by the normalized likelihood function, and calculate 
(2a) f{y^\X,i = = 1;0(™)){/(y,|X« = 0,Xoi,5^ = 

(2b) log[pr(S’i = l\Xei = l,Xoi;0^”')){pr(S’i = l|Xe* = 0 ,Xq^; 0^”*)}“^]. 

(3) On sampled subjects, using (2b) as an offset, fit a logistic regression of 
Xei on Xoi to obtain parameter (call it a) and uncertainty estimates 
for the marginal exposure model shown in equation (2.7). Then, draw 
Qjl™) from a A^[S,Cov(S)] and calculate 

(3a) pr(Aei = l|xoj,5i = 1;Q:(™-)){pr(Aei = 0|xoj,5* = 

(4) For unsampled subjects, multiply the results of (2a) and (3a) to calculate 
the conditional exposure odds in equation (2.6) and then draw imputed 

values, 

(5) Conduct standard maximum likelihood analysis on the response model 
using the complete imputation data set. 
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(6) Repeat steps (2)-(5) M times and combine results in the standard man¬ 
ner. 

To the extent that the assnmptions of the response and marginal exposnre 
models are correct, the foregoing CD-I-MI approach is expected to be valid 
and relatively efficient compared to the CD approach. It is worth noting 
that the imputation model for the CD-I-MI approach is a general location 
model that is discussed in, for example, Little and Schluchter (1985), Schafer 
(2010), and Little and Rubin (2002). 


2.4.2. Imputation model construction: Direct conditional exposure model. 
Another approach to constructing the impntation model is relatively di¬ 
rect and could employ available MI software. In contrast to CD-I-MI, it 
decouples the imputation and the analysis models. We refer to it as direct 
mnltiple imputation (D-MI) and it is a special case of multiple imputation 
by chained equations [e.g., Raghunathan et al. (2001), White, Royston and 
Wood (2011)] which is implemented in software packages snch as MICE 
[Van Bunren (2012)] in the R programming language [R Core Team (2013)]. 
We may ascertain and sample from [Xei\yi,Xoi, Si = 0] directly by noting 
that the conditional exposure odds model on the left-hand side of equation 
(2.6) can be constructed using logistic regression analysis with any functions 
of Yi and Xq* as independent variables. Since X^^i T S'il(Yj,Xoj) by design, 
then if the Gaussian linear mixed model assnmptions are satisfied, the in¬ 
duced conditional exposure log-odds from equation (2.6) can be written 

-^{(Yi - /.i,J*V-/(Yi - - (Y, - /Xo,J*V-/(Y, - /xo,,)} 

( 2 . 8 ) ^ 

-^log 


Vi,il) fpr(Ve, = llXo,)'l 

Vo,*!/ °^lpr(Xe. = 0lX,,)/’ 


where /i^.^ = E{Yi\Xei = 3:,Xoi), = Var(Yi|Vei = x,Xoi) = ZiB^iZi -F 

Ugl. If we assume homoscedasticity, then Vi^j = Vo,i = Vj, and equation 
(2.8) simplifies to 


(2.9) 




(a) 




- Ro.iVi Vo,i) 

(b) 


-Flog 


pr(Xex = l|Xoj) ) 
pr(Vei = OjXoi) / ■ 
(c) 


In the Supplement A [Schildcrout et al. (2015)], we detail further simplifi¬ 
cations with balanced and complete data that we examine in Section 3 and 
that are motivated by the CAMP analysis whose design was nearly balanced 
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and complete. Briefly, with balanced and complete data, if t'j ja, is the (j, A:)th 
element of V~^ and ujij{tij) is the Xei effect at time Uj (i.e., ^i^ij — 

(a) equals ELi ^hjk ■ Vij ■ ^ik{tik)i (b) equals YJj=i Yl=i ^i,jk ■ ■ 

uJikiUk)+ Y^]=i Yj'k=i k>i,jk-^ij{tij)-uJikiUk), and (c) contains terms involving 
Xoj useful for predicting Xei ■ Our approach to imputation is then to directly 
model [Xej|Xoj, Yj] with logistic regression and to follow standard multiple 
imputation methods. We note that the first two terms in equations (2.8) 
and (2.9), respectively, result in the functional form of quadratic and linear 
discriminant analysis [Fisher (1936)] that are used in many classification 
analyses. 

3. Finite sampling operating characteristics. The key motivator in out¬ 
come dependent sampling schemes is to obtain nearly efficient inference at 
considerable cost savings by drawing and analyzing small to modest sample 
sizes. Indeed, the CAMP study could have realized considerable savings if 
it had only analyzed 250 genotypes, versus more than 500. As such, it is 
critical in application of these design strategies to quantify the degree to 
which theoretical results are realized in finite sample settings. Schildcrout, 
Garbett and Heagerty (2013) conducted such simulations, that are briefly 
summarized in the Introduction. We now examine the CD-I-MI and D-MI 
estimation procedures proposed in Section 2 to explore: (1) the scenarios 
under which MI does and does not improve estimation efficiency over a CD 
analysis; and (2) the extent to which the study design continues to improve 
efficiency if MI is the intended analytical strategy. 

3.1. Population model. We conducted simulation studies under several 
study designs and population features motivated by the CAMP study and 
by studies with similarly-balanced longitudinal follow-up. Results presented 
here summarize 1000 replications per scenario. In each scenario, we gener¬ 
ated a cohort of N subjects based on the model 

^ij — /3o T fdttij T fdgQi T Pgtdilij T f^cC-i “t" T blil'ij T ^ij j 

with i G {1,2,..., A^} denoting snbject, j G {1,2,..., 10} denoting observa¬ 
tion within subject, Uj an equally spaced, balanced time covariate ranging 
from —2 to 2, Ci a binary, time-invariant covariate with pr(Ci = 1) = 0.5, Gi 
an expensive, binary “group” or “genotype” variable with pr(Gi = l\Ci = 
c) = 0.4 + ScC, {bio,bii) the random intercept and slope, and Sij the mea¬ 
surement error. Across all scenarios, {f3o, jdt, Pgt) = (5,1.0,0.75), the mean 
of the random effects and error distributions were 0, and the standard 
deviations of the random intercept, the random slope and the measure¬ 
ment error were (Tq = 5, ui = 1.25 and Ue = 5, respectively. Additionally, 
p = corr(6oi, ^li) = -0.25. 
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We examined the relative efficiency of the designs and estimation proce¬ 
dures as a function of the following: the Gi effect size, fig G {—2.5,—4.0}, 
the strength of the Gi ~ Gi relationship, 5c G {0.15,0.35,0.55}, the sample 
size of the original cohort N G {750,2250}, and the impact of Gi being a 
proxy for Gi as opposed to being a confounder for the Gi ^ Yi relation¬ 
ship. In the last scenario, /3c = 0 and Gi is used to impute Gi but is not 
included in the primary analysis model. In all other scenarios, /3c = 1 and 
Gi is included as an independent variable. Specifically, we examine five dis¬ 
tinct scenarios uniquely identified by (N, fJg,5c, Pc)- Scenarios studied are 
given by the following: (a) (750,-2.5,0.15,1.0), (b) (750,-4.0,0.15,1.0), (c) 
(750,-2.5,0.35,1.0), (d) (2250,-2.5,0.15,1.0), and (e) (750,-2.5,0.55,0.0). 

3.2. Study designs. The substudies we sought to examine were those 
that sampled, on average, 250 subjects for whom Gi should be ascertained, 
again motivated by the CAMP framework. For the random sampling (RS) 
design, we took a simple random sample of 250 subjects at each replication. 
For ODS designs based on the intercept (ods.i), slope (ods.s), and bivariate 
intercept and slope (ods.b), we calculated subject-specific intercepts and 
slopes based on the N separate OLS regressions of the response Yij on time 
tij, and sampled subject i with probability that depended upon the region 
in which Qj was located. For ods.i and ods.s we split the distribution of 
the sampling variable Qj into three regions defined by the 12th and 88th 
percentiles of the population distribution. We then sampled individuals with 
probability 7r(qj) = pr(5j = l|Qi = qj) so that, on average, 90 subjects from 
each of the two outlying regions and 70 subjects from the central region were 
included in the outcome dependent sample. Similarly for the ods.b design, we 
sampled with probability so that 70 subjects were included from the central 
rectangular region that contained 76 percent of the population and 180 
subjects were included from the outlying region containing 24 percent of the 
population. See Schildcrout, Garbett and Heagerty (2013) for a description 
and a figure describing these sampling schemes. 

3.3. Analyses. After subsampling from the original cohort of N, we con¬ 
ducted the CD analysis by fitting the model with maximum ascertainment 
corrected likelihood under the ODS designs or with standard maximum like¬ 
lihood (ML) under the RS design. To conduct multiple imputation analyses, 
we estimated the multiple imputation model for Gi in unsampled subjects 
pr(Gj|yj, Cj,S'j = 0) via approaches discussed in Sections 2.4.1 and 2.4.2. 
Specifically, the imputation model for CD-I-MI analyses was estimated by 
combining the CD analysis and the offsetted logistic regression analysis of gi 
on Cj in sampled subjects. The imputation model for the D-MI approach was 
estimated with a regression model of Gi on predictors Ylj Uij i Ylj Vij ' tij and 
Cj in sampled subjects. See the online supplementary materials [Schildcrout 
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et al. (2015)] for an explanation of why these independent variables were 
used in the imputation model. The number M of imputation samples used 
was based on examination of the degrees of freedom that were calculated 
as described in Section 2.4 and with the intention of conducting sufficient 
imputation analyses so that the t-statistics associated with all parameter es¬ 
timates were well approximated by normal distributions for all parameters. 
When N = 750, M = 25 was used; when N = 2250, M = 35. 

3.4. Results. Because the models were properly specihed, all estimation 
procedures were observed to be approximately valid with observed biases 
in parameter estimates less than 5% and observed biases in standard errors 
less than 10% (not shown). 

Table 1 shows the efficiency of each design and analysis procedure com¬ 
bination relative to the RS design and standard CD maximum likelihood 
analysis. Relative efficiency is defined as the empirical variance under RS 
plus CD analyses divided by the empirical variance under each other design 
and estimation procedure. Note that the CD-I-MI and D-MI approaches 
perform similarly for nearly all parameter-by-scenario combinations. In sce¬ 
nario (a) we observe that for f3g and jSgt the impact of the study design 
far outweighs the impact of multiply imputing Gi. For example, using CD 
analyses to estimate j3gt, the ods.s design improves estimation efficiency by 
87 percent over RS, but adding multiple imputation to the CD analysis by 
using the CD-I-MI approach improves efficiency only by an additional 7.4 
percent (2.01/1.87 = 1.074). However, if interest is in estimates of (3^ which 
correspond to Cj, a covariate that is available in everyone, the impact of 
multiple imputation outweighs the study design. Notice that with both the 
CD-I-MI and D-MI approaches all designs have a relative efficiency for fic 
of approximately 2.6-2.7 compared to random sampling with CD analyses. 
For estimates of /3o and Pt, the study design and multiple imputation-based 
analyses independently contributed to optimal estimation efficiency. 

Scenarios (b), (c), (d) and (e) provide some insight into how the results 
shown in scenario (a) depend upon population data features. We used these 
scenarios specifically to examine the extent to which MI adds to the optimal 
study design with CD analyses and we now focus our discussion exclusively 
on Pg and j3gt. Comparing results from scenario (b) to (a), we observed 
that the impact of MI is somewhat greater when the Gi effect size is larger. 
Whereas in scenario (b), when estimating CD-I-MI was 20 percent more 
efficient than CD for the optimal ods.i design (1.99/1.65 = 1.20), in scenario 
(a) it was only 4 percent more efficient (2.20/2.11 = 1.04). As shown by 
comparing results from scenarios (c) and (d) to (a), we observe that MI 
appears to add modest additional precision to the optimal design when the 
Gi ~ Gi relationship is stronger and when the original cohort size is larger. 
Finally, in scenario (e) we observed that when Gi is a proxy for Gi rather 


Table 1 

Relative efficiency: Results show ratios of the empirical variance of the RS design with standard CD analyses to the empirical variance 
of all other study design and analysis procedures across 1000 replicates. The designs ods.i, ods.s and ods.b are CDS designs with 
sampling based on the intercept, slope, and both intercept and slope of subject-specific ordinary least squares regression of Yij on tij. For 
each parameter we show columns that correspond to CD, CD-\-MI and D-MI analyses, respectively. In scenario (e) we do not estimate 

/3c, as Ci is not included in the final model but is only used for Gi imputation 



N 

, f3g, 



Design 


/3o 



Pt 



Pa 



Pgt 



Pc 


(a) 

750, 

-2.5, 

0.15, 

1.0 

RS 

1.00, 

1.88, 

1.90 

1.00, 

1.68, 

1.64 

1.00, 

1.02, 

1.03 

1.00, 

1.13, 

1.09 

1.00, 

2.66, 

2.65 






ods.i 

2.18, 

2.63, 

2.63 

0.89, 

1.37, 

1.35 

2.11, 

2.20, 

2.19 

O 

bo 

00 

0.94, 

0.92 

1.99, 

2.64, 

2.64 






ods.s 

1.02, 

1.89, 

1.90 

2.01, 

2.32, 

2.27 

1.00, 

1.00, 

1.02 

1.87, 

2.01, 

1.96 

1.03, 

2.61, 

2.62 






ods.b 

1.82, 

2.42, 

2.41 

1.64, 

1.97, 

1.97 

1.75, 

1.79, 

1.82 

1.52, 

1.59, 

1.59 

1.72, 

2.67, 

2.65 

(b) 

750, 

-4.0, 

0.15, 

1.0 

RS 

1.00, 

1.90, 

1.92 

1.00, 

1.65, 

1.67 

1.00, 

1.20, 

1.21 

1.00, 

1.14, 

1.16 

1.00, 

2.65, 

2.59 






ods.i 

1.79, 

2.17, 

2.14 

1.02, 

1.61, 

1.57 

1.65, 

1.99, 

1.96 

1.01, 

1.07, 

1.04 

1.83, 

2.27, 

2.20 






ods.s 

1.01, 

1.85, 

1.83 

2.30, 

2.71, 

2.74 

0.91, 

1.06, 

1.05 

2.19, 

2.35, 

2.36 

1.00, 

2.49, 

2.48 






ods.b 

1.57, 

2.13, 

2.10 

2.03, 

2.44, 

2.42 

1.43, 

1.57, 

1.57 

1.85, 

1.98, 

1.93 

1.79, 

2.53, 

2.46 

(c) 

750, 

-2.5, 

CO 

O 

1.0 

RS 

1.00, 

1.90, 

1.91 

1.00, 

1.61, 

1.51 

1.00, 

1.05, 

1.04 

1.00, 

1.23, 

1.15 

1.00, 

2.26, 

2.27 






ods.i 

2.03, 

2.62, 

2.62 

1.00, 

1.56, 

1.48 

1.95, 

2.09, 

2.12 

0.96, 

1.15, 

1.08 

1.90, 

2.37, 

2.39 






ods.s 

1.13, 

2.06, 

2.05 

2.10, 

2.53, 

2.53 

1.01, 

1.06, 

1.07 

2.07, 

2.33, 

2.33 

1.00, 

2.28, 

2.28 






ods.b 

1.89, 

2.51, 

2.51 

1.88, 

2.28, 

2.24 

1.71, 

1.81, 

1.78 

1.84, 

2.02, 

1.95 

1.67, 

2.45, 

2.41 

(d) 

2250, 

-2.5. 

, 0.15, 

1.0 

RS 

1.00, 

2.97, 

3.01 

1.00, 

2.03, 

2.00 

1.00, 

1.07, 

1.07 

1.00, 

1.14, 

1.11 

1.00, 

5.83, 

5.79 






ods.i 

2.06, 

4.69, 

4.67 

0.99, 

1.97, 

1.89 

1.76, 

2.01, 

2.01 

0.95, 

1.11, 

1.07 

1.89, 

5.75, 

5.74 






ods.s 

0.98, 

2.85, 

2.89 

2.12, 

3.75, 

3.70 

0.92, 

0.95, 

0.97 

2.05, 

2.44, 

2.39 

0.86, 

5.61, 

5.52 






ods.b 

1.65, 

3.98, 

4.07 

1.83, 

3.25, 

3.21 

1.52, 

1.57, 

1.60 

1.81, 

2.02, 

1.98 

1.53, 

5.76, 

5.50 

(e) 

750, 

-2.5, 

0.55, 

0.0 

RS 

1.00, 

1.71, 

1.60 

1.00, 

1.79, 

1.59 

1.00, 

1.50, 

1.37 

1.00, 

1.52, 

1.32 









ods.i 

1.95, 

2.33, 

2.33 

1.03, 

1.64, 

1.49 

1.98, 

2.29, 

2.29 

0.92, 

1.39, 

1.20 









ods.s 

1.04, 

1.65, 

1.58 

1.99, 

2.36, 

2.33 

0.99, 

1.46, 

1.36 

2.03, 

2.37, 

2.33 









ods.b 

1.77, 

2.16, 

2.09 

1.80, 

2.21, 

2.17 

1.75, 

2.06, 

1.96 

1.77, 

2.17, 

2.08 
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than a confounder and when the Gi ~ Ci relationship is relatively strong 
with 5c = 0-55, adding MI to the optimal design led to larger efficiency gains 
for fig and figt- For example, the relative efficiency of CD+MI relative to CD 
analyses for the optimal designs for fig and figt were 2.29/1.98 = 1.16 and 
2.37/2.03 = 1.17, respectively. 

Multiple imputation resulted in substantial efficiency improvements over 
CD analysis for estimates of {(3o, Pt, (3c)i but had a far smaller impact on 
estimation efficiency for {(3g,l3gt). Figure 1 shows the relative efficiency for 
estimating the mean value at the end of the study period for those with 
(Gj,Q) = (1,1), = E(Yij\Gi = l,Ci = l,tij = 2) under all scenarios. By 

combining parameter estimates to obtain the linear predictor estimate we 
observed that in all scenarios and for all study designs, CD+MI and D- 
MI analyses are substantially more efficient than CD analyses. That is, MI 
improved estimation efficiency dramatically, and the study design itself had a 
more modest impact. However we also note that the ods.b design is the most 
efficient design in all scenarios for estimating the end-of-study mean value. 


(a) (b) (c) 



(d) (e) 



Fig. 1. Relative efficiency for estimating the predicted value at the end of the study period 
Mi ,10 = E(Ti,io|Gi = l,Ci = l,tij = 2) for all design and analysis procedure combinations 
versus RS and standard CD analyses based on 1000 replications. Symbol o denotes CD 
analyses, A denotes CD+MI analyses, and + denotes D-Ml analyses. Parameter values 
(a)-(e) are given in Table 1. 
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Even though ods.b was not the optimal design for any single parameter 
(see Table 1), it is reasonably efficient for all parameters, which is beneficial 
if more than one parameter is of interest. In contrast, the ods.s and ods.i 
designs were efficient for individual parameters but were inefficient for other 
parameters. 

4. CAMP data analysis. In this section we analyze the CAMP data using 
different subsampling designs both with and without imputation. Our goal 
is to empirically compare the research efficiency of candidate designs, and we 
have the complete data against which we can benchmark performance. Since 
our simulation study showed that CD+MI and D-MI approaches are similar, 
we focus our presentation on only one imputation approach (CD+MI). A 
total of 555 subjects had sufficient covariate and genotype data available, 
and we operate under the assumption that stored blood samples are available 
for all participants, although study resources only permit genotyping 250. 
Thus, approximately 250 genotypes are used at each of 30 replications of 
each study design. We report results based on the average estimates and 
(co)variances. Similar to the simulations, we consider four designs: random 
subsampling of 250 children (RS) and three ODS designs. To create the ODS 
designs, we first compute all estimated intercepts and slopes from subject- 
specific simple linear regressions of post-bronchodilator percent predicted 
FVC (FVC%) on time since randomization. Sampling was then based on the 
following: the estimated intercept (ods.i), the estimated slope (ods.s), or the 
estimated intercept and slope jointly (ods.b). In order to obtain 250 subjects, 
the cutoff points that define strata in the ods.i and ods.s designs are given 
by the 16th and 84th percentiles of the original cohort. We sampled with 
probability 1 subjects at or below the 16th percentile and at or above the 
84th percentile, and with probability 0.19 all subjects falling in the central 
68 % region. For ods.b, we sampled with probability 0.19 all subjects who 
fell in the central 68 % region of the joint intercept and slope distribution 
in the original cohort and with probability 1 all of those falling outside this 
region. Table 2 shows the characteristics of the CAMP cohort from which 
we subsampled for the ODS studies. 

The primary scientihc goals of the CAMP analysis are to examine the 
treatment effects within subgroups defined by the presence or absence of a 
variant allele (VA) on the fourth locus of the ILIO gene, and to examine 
the difference in lung growth between those with and without a VA. Three- 
way interactions (ILIO x medication x tij) were explored, however, we only 
report results from two-way interactions. Thus, the fitted model for this 
analysis was 

E[yij\Xi] = /3o + PiUj + h ■ budi + /I 3 • nedi + 13^ ■ ILlOj + /I 5 • budi ■ ILlOj 
+ /Se • nedi ■ ILlOj + (3j ■ tij ■ ILlOj + 13c ■ covariatesij. 
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Table 2 

Demographic and other characteristics of children participating 
in the CAMP with genotype and covariate data available. 
Continuous variables are summarized with the 10th, 50th and 
90th percentiles, and categorical variables other than site are 
summarized with proportions 


Variable 

Cohort size {N) 

555 

Albuquerque 

41 

Baltimore 

71 

Boston 

72 

Denver 

64 

San Diego 

68 

Seattle 

80 

Saint Louis 

91 

Toronto 

68 

Age at randomization (years) 

6.23, 8.81, 11.71 

Male gender 

0.65 

Black race 

0.10 

Other (noncaucasian) race 

0.26 

Randomized treatment 


Placebo 

0.50 

Budesonide 

0.32 

Nedocromil 

0.17 

ILIO variant allele 

0.50 

Observations per subject 

9, 10, 10 

Follow-up time (years) 

3.85, 3.99, 4.1 

Post-bronchodilator percent predicted 

92, 105, 116 


The covariates that represent the key biomedical questions include the fol¬ 
lowing: the binary time invariant ILIO SNP (sripj; time since randomization 
{ti = {til,... ,tini})', Budesonide {budi) and Nedocromil (nedi) treatments 
(with placebo being the reference); and pairwise interactions between ILIO 
and the other variables. As described in Section 2.4, the imputation ap¬ 
proaches required a model for the predictor of interest, X^i = snpi, in or¬ 
der to impute its value for subjects not selected for the subsample {Si = 0). 
Therefore, the CD-I-MI analysis procedure required estimation of a marginal 
exposure distribution (i.e., [Xei\X.oi, Si = 1]), and in that model, race*, sitci, 
genderi, budi and nedi were used as independent variables (Xq*) in an ad¬ 
ditive logistic regression model. 

Table 3 shows CAMP regression summaries based on the original cohort 
analysis using all subjects (A^ = 555), and on eight combinations of subsam¬ 
pling designs with and without imputation, where only N ~ 250 children 
were included in a subsample. We provide the key summaries that specif¬ 
ically address the primary research questions, but interested readers may 





Table 3 

CAMP results: estimated summaries and standard error estimates (in parentheses) based on 30 replications of each study design. At 
each replication, twenty imputation samples were used for the CD+MI analyses. We do not include the standard errors for variance 
components with the CD+MI approach because the Imef package [Bates and Maechler (2010)] does not provide them. Although site 
effects are not shown, they were included as fixed effects in regression analyses. The estimated mean row corresponds to the estimated, 
end-of-study mean value for the population of white, 12 year old girls, with VAs who were randomized to placebo treatment and who lived 
in Baltimore. The original cohort column displays results from the analysis of the full cohort of 555 participants 


Variable 


Original 

cohort 


RS 


ods.s 


ods.i 


ods.b 


CD 


CD+MI 


CD 


CD+MI 


CD 


CD+MI 


CD 


CD+MI 


Primary summaries 


Budesomide (vs placebo) at all times 
No VAs -2.11 (1.16) -1.57 (1.73) -2.09 

With VAs 3.29 (1.24) 3.08 (1.86) 3.08 

Difference 5.40 (1.70) 4.65 (2.54) 5.17 

Nedocrimil (vs placebo) at all times 

No VAs -0.77 (1.17) -0.62 (1.73) -0.56 

With VAs 0.69 (1.10) 0.73 (1.63) 0.54 

Difference 1.46 (1.61) 1.35 (2.39) 1.10 

Time trend (per year) irrespective of treatment 
No VAs 0.14 (0.16) 0.11 (0.23) 0.09 

With VAs -0.25 (0.15) -0.19 (0.23) -0.19 

Difference -0.39 (0.22) -0.30 (0.33) -0.27 


(1.46) 

-3.65 

(1.73) 

-2.92 

(1.45) 

-2.39 

(1.52) 

4.12 

(1.92) 

4.18 

(1.54) 

3.99 

(2.43) 

7.78 

(2.57) 

7.10 

(2.42) 

6.39 

(1.46) 

-2.96 

(1.73) 

-1.59 

(1.46) 

-1.11 

(1.39) 

0.42 

(1.63) 

1.45 

(1.36) 

0.31 

(2.36) 

3.38 

(2.41) 

3.04 

(2.33) 

1.42 

(0.19) 

0.14 

(0.17) 

0.10 

(0.16) 

-0.04 

(0.19) 

-0.19 

(0.17) 

-0.21 

(0.16) 

-0.38 

(0.31) 

-0.33 

(0.24) 

-0.31 

(0.24) 

-0.35 


(1.41) 

-2.73 

(1.29) 

-2.65 

(1.56) 

-2.68 

(1.34) 

(1.55) 

3.95 

(1.38) 

3.51 

(1.67) 

3.69 

(1.38) 

(2.10) 

6.68 

(2.03) 

6.16 

(2.34) 

6.37 

(2.10) 

(1.41) 

-0.56 

(1.24) 

-2.16 

(1.51) 

-0.96 

(1.31) 

(1.31) 

0.64 

(1.21) 

-0.02 

(1.36) 

0.92 

(1.20) 

(1.94) 

1.20 

(1.87) 

2.14 

(2.08) 

1.88 

(1.95) 

(0.22) 

0.09 

(0.18) 

0.19 

(0.18) 

0.13 

(0.17) 

(0.22) 

-0.21 

(0.18) 

-0.25 

(0.18) 

-0.24 

(0.16) 

(0.31) 

-0.30 

(0.29) 

-0.44 

(0.26) 

-0.37 

(0.25) 

(1.30) 

-1.90 

(1.32) 

-1.29 

(1.46) 

-1.50 

(1.39) 

(1.5) 

-3.12 

(1.52) 

-3.05 

(1.52) 

-2.98 

(1.48) 

(1.99) 

106.42 

(1.63) 

106.79 

(1.97) 

106.47 

(1.59) 


ILIO (VA vs no VA) in the placebo arm at baseline and year 4 
tij^o -1.65 (1.15) -1.67 (1.70) -1.78 (1.69) -2.20 (1.65) -2.13 (1.67) -1.72 

tij-4 -3.2 (1.17) -2.87 (1.73) -2.88 (1.71) -3.51 (1.68) -3.35 (1.68) -3.11 

Estimated meanl06.33 (1.51) 106.86 (2.32) 106.47 (1.64) 107.95 (2.20) 106.23 (1.62) 106.69 
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Table 3 
( Continued) 


Variable Original RS ods.s ods.i ods.b 

cohort - 

CD CD+MI CD CD+MI CD CD+MI CD CD+MI 


Other mean model parameters 

Male (vs female) -1.14 (0.72) -1.47 (1.08) -1.22 (0.73) -1.47 (1.07) -1.13 (0.72) -0.71 (0.86) -1.16 (0.72) -1.19 (0.90) -1.21 (0.72) 


Black (vs white) 

0.51 

(1.21) 0.52 (1.87) 

0.53 

(1.25) 1.22 (1.85) 

0.76 (1.23) 1.19 (1.56) 

0.47 

(1.24) 1.88 (1.50) 

0.67 

(1.23) 

Other (vs white) 

-0.81 

(0.98) -0.95 (1.44) 

1 

o 

(0.99) -1.31 (1.44) 

-0.59 (1.00) -0.01 (1.15) 

-0.71 

(0.99) -0.32 (1.20) 

-0.61 

(0.99) 

Age {tij = 0) 

-0.21 

(0.17) -0.23 (0.26) 

-0.22 

(0.17) -0.40 (0.26) 

-0.23 (0.17) -0.50 (0.21) 

-0.22 

(0.17) -0.39 (0.22) 

-0.22 

(0.17) 





Variance components 





log(cro) 

2.19 

2.18 (0.05) 

2.19 

2.16 (0.05) 

2.18 2.18 (0.04) 

2.18 

2.18 (0.04) 

2.18 


log(cri) 

0.84 

0.85 (0.06) 

0.84 

0.84 (0.05) 

0.84 0.83 (0.05) 

0.84 

0.84 (0.05) 

0.84 


log(l+p) 

log{l-p) 

-1.70 

-1.13 (0.15) 

-1.70 

-1.06 (0.12) 

-1.70 -1.11 (0.12) 

-1.70 

-1.08 (0.12) 

-1.69 


log(o-e) 

1.55 

1.54 (0.02) 

1.55 

1.60 (0.02) 

1.55 1.60 (0.02) 

1.55 

1.62 (0.02) 

1.55 
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look to online supplementary materials [Schildcrout et al. (2015)] for all 
longitudinal model regression estimates and interactions used to generate 
the summaries. Specifically, we focus on medication effects and time trends 
within subgroups defined by presence or absence of an ILIO variant, the dif¬ 
ference in expected FVC between those with and without an ILIO variant 
at baseline {tij = 0) and at the end of the study {Uj = 4) for subjects on 
placebo treatment, and the end-of-study predicted mean value. 

In the original cohort analysis we observed the following associations that 
were statistically signihcant at the a = 0.05 level: (1) for subjects with an 
ILIO variant, the expected FVC% was estimated to be 3.29 (1.24) units 
higher across all times in those randomized to Budesomide compared to 
placebo; (2) the effect of Budesomide compared to placebo was 5.40 (1.70) 
units higher in those with an ILIO variant than in those without an ILIO 
variant; and (3) at the end of the study {tij = 4), those with an ILIO variant 
were estimated to have FVC% values that were 3.20 (1.17) units lower than 
those without an ILIO variant. Our interest is in the impact of subsampling 
design choices, so a natural option to consider is a simple random sample. 
However, although the random sampling design produced point estimates 
that were similar to results from the original cohort, none of the full cohort- 
based associations would be considered statistically significant using the RS 
design. In contrast, all ODS designs detected the three significant effects seen 
in the original cohort, demonstrating the potential efficiency gains though 
use of biased sampling in a resource-limited environment. 

Furthermore, for all designs the use of imputation (CD-I-MI analysis) im¬ 
proved estimation efficiency of key parameters. For example, when sampling 
using ods.b, the standard error for the Budesomide versus placebo contrast 
was 1.67 under the CD analysis, 1.38 under the CD-I-MI analysis and 1.24 
for the original cohort analysis. Such efficiency gains due to MI were also 
observed in all coefficient estimates for the other baseline covariates mea¬ 
sured on all subjects (e.g., age, race and gender). In contrast, and consistent 
with simulations, CD-I-MI did not produce appreciably smaller estimates of 
uncertainty than CD analyses for parameters that capture (retrospectively 
ascertained) ILIO effects and interactions. For example, under the ods.b de¬ 
sign, the standard error estimate for the ILIO VA association with FVC% in 
the placebo arm at tij = 4 was 1.52 and 1.48 with CD and CD -|- MI analyses, 
respectively. Similarly, the standard error estimate for the difference in the 
time trends between those with and without the ILIO VA was 0.26 and 0.25 
with CD and CD -|- MI analyses, respectively. 

Finally, for many parameters, the combination of subsampling and the 
use of imputation was able to recover a large fraction of the information 
present in the original cohort but with less than half the cost in terms of 
number of subjects for whom covariates would be ascertained. For example, 
all estimators produced quite similar estimates of the predicted mean value 
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at the end of the study, ranging from 106.33 to 107.95, and the ods.b plus 
CD+MI combination estimated the standard error to be 1.59, only slightly 
higher than the 1.51 estimated from the original cohort. In summary, the 
CAMP analysis illustrates that targeted subsampling is typically more ef¬ 
ficient than simple random sampling, and that using all available data is 
also beneficial and can be easily accomplished through imputation of data 
for those subjects not selected in a given subsample. We recommend that 
future ancillary studies of existing longitudinal cohorts consider the benefits 
of directed sampling coupled with efficient analysis. 

5. Discussion. The CAMP longitudinal clinical trial was conducted in 
an era when genotyping was more expensive than today. Owing to ongo¬ 
ing interest in treatment heterogeneity, it is of interest to examine whether 
treatment effectiveness varies across genotype. Because this would be a 
secondary aim of most trials, it makes sense economically to conduct the 
trial, obtain response trajectories and test for overall treatment effectiveness 
first. Depending on what is learned through those primary investigations, 
investigators—or their colleagues—may then wish to move ahead with other 
exposure assessments to examine exposure effects or treatment-by-exposure 
interactions. Such data could be used for confirmatory analyses or, more 
likely, for pilot or preliminary data in an exploratory model. In these kinds 
of settings, especially, cost effectiveness is critical, and can make the differ¬ 
ence between a study being viable or not. 

To address such problems, in this manuscript we discussed novel statistical 
approaches to the combination both of ODS designs and of efficient analy¬ 
ses for longitudinal continuous response data. We observed that Ml-based 
approaches can improve efficiency dramatically over CD analyses for pa¬ 
rameters corresponding to estimation targets involving covariates that were 
not imputed (e.g., demographics and the estimated mean value in CAMP). 
Efficiency improvements were more modest for the coefficients of imputed co¬ 
variates (e.g., the VA by time interaction under the ods.s design in CAMP), 
although such results can be influenced by data features (e.g., effect size 
in simulation). Importantly, we also observed that, even when MI is a de¬ 
fault analytical choice, ODS designs can still improve efficiency dramatically 
in targets associated (directly or indirectly through interactions) with the 
retrospectively ascertained covariate. 

Because this manuscript discusses what we believe are new study designs, 
we were not able to analyze data directly from such a study. Such studies 
have yet to be conducted. Instead, to describe the characteristics of the 
designs and estimators, we replicated simulated substudies from CAMP. 
While this may not appear to be ideal at first, it allowed us to explore 
alternative CAMP substudy designs and did not lock us in to a single design. 


BIASED SAMPLING DESIGNS TO IMPROVE RESEARCH EFFICIENCY 21 


The two MI strategies, CD-MI and D-MI, approach parameter estimation 
in somewhat different ways, even in the context of the overall MI framework. 
Specifically, both approaches require careful consideration of two model spec¬ 
ifications. Whereas the outcome model [Yj|Xej,Xoj] is common to both 
strategies, CD-I-MI requires the direct specification of a marginal exposure 
model [Xej|Xoj] and D-MI requires the direct specification of the fully con¬ 
ditional exposure model [XeilXoi, Y,]. We believe that each approach has 
an important advantage. In a relative way, CD-I-MI may be considered ad¬ 
vantageous because the marginal exposure model is likely to be relatively 
simple as compared to the conditional exposure model, and so the focus of 
analysis with CD-I-MI is on the outcome model. The conditional exposure 
model that is directly specified with D-MI is likely to involve additional 
consideration of the functional form of a time-varying (response) variable 
toward prediction of a time-fixed exposure variable. In contrast, the D-MI 
may be considered more flexible because the outcome and imputation mod¬ 
els are decoupled. As compared to CD-I-MI, it could potentially be more 
robust to misspecification of the outcome model. 

Finally, a rigorous evaluation of competing approaches (e.g., inverse prob¬ 
ability weighting) is next in this line of research. A key reason we have not 
pursued that here is that we are primarily interested in situations wherein a 
full likelihood approach for both estimation and inference is of interest. The 
IPW approaches step out of that paradigm, instead relying on sandwich- 
type variance estimators, making the comparison among the methods more 
complex. Other areas of future research that specifically pertain to the im¬ 
putation approaches involve extensions of the exposure variable to contin¬ 
uous, ordinal and time-varying data. We also intend to explore imbalanced 
time-varying covariates, unequal cluster sizes, general patterns of missing 
data/dropout, mean model misspecification and imputation model misspec¬ 
ification. 
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SUPPLEMENTARY MATERIAL 

Supplement A: D-MI Derivation for the model used in simulation (DOI: 
10.1214/15-AOAS826SUPPA; .pdf). Derivation of the D-MI imputation mo¬ 
del used in simulations (in Section 2.4.2). 

Supplement B: CAMP Results: Parameter and uncertainty estimates 

(DOI: 10.1214/15-AOAS826SUPPB; .pdf). Results from the CAMP anal¬ 
ysis that were used to derive the summaries in Table 3. 
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